Method and system for sample size determination for database optimizers

ABSTRACT

A system and method for determining an adequate sample size for statistics collection is disclosed. A mechanism for automatically determining an adequate sample size for both statistics and histograms is provided. The sample size determination is accomplished via an iterative approach where the process starts with a small sample, and for each attribute which may need more data, the sample size is increased while restricting the information collected to only those attributes that require the larger sample.

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application is a divisional of U.S. application Ser. No.09/872,565, filed on May 31, 2001, which is hereby incorporated byreference in its entirety for all purposes as if fully set forth herein.

COPYRIGHT NOTICE

[0002] A portion of the disclosure of this patent document containsmaterial which is subject to copyright protection. The copyright ownerhas no objection to the facsimile reproduction by anyone of the patentdocument or patent disclosure, as it appears in the Patent and TrademarkOffice patent file or records, but otherwise reserves all copyrightrights whatsoever.

BACKGROUND AND SUMMARY

[0003] The present invention relates to the field of computer systems.More particularly, the invention relates to a method and system fordatabase optimization.

[0004] A “query” is a statement or collection of statements that is usedto access a database. Specialized query languages, such as thestructured query language (“SQL”) are often used to interrogate andaccess a database. Many types of queries include at least the following.First, the identity of the database object(s) being accessed to executethe query (e.g., one or more named database tables). If the queryaccesses two or more database objects, what is the link between theobjects (e.g., a join condition or column). The typical query alsodefines selection criteria, which is often referred to as a matchingcondition, filter, or predicate. Lastly, a query may define which fieldsin the database object are to be displayed or printed in the result.

[0005] Optimization is the process of choosing an efficient way toexecute a query statement. Many different ways are often available toexecute a query, e.g., by varying the order or procedure in whichdatabase objects and indexes are accessed to execute the query. Theexact execution plan or access path that is employed to execute thequery can greatly affect how quickly or efficiently the query statementexecutes.

[0006] Cost-based optimization is an approach in which the executionplan is selected by considering available access paths to determine thelowest cost approach to executing the query. In one approach, cost-basedoptimization consists of the following steps: (1) generating a set ofpotential execution plans for the database statement to be executed; (2)estimating the cost for each execution plan; and (3) comparing the costsof the execution plans to identify the execution plan having the lowestcost. Conceptually, the term “cost” relates to the amount of a givenresource or set of resources needed to process an execution plan.Examples of such resources include I/O, CPU time, and memory. Variousmeasures may be used to identify the execution plan having the lowestcost. For example, the cost-based approach may be used to identify theexecution plan providing either the best throughput or the best responsetime.

[0007] Many database optimizers use statistics to calculate the“selectivity” of predicates and to estimate the cost of performingdatabase operations. Statistics quantify characteristics of database andschema objects, such as the data distribution and storagecharacteristics of tables, columns, indexes, and partitions. Selectivityrefers to the proportion or fraction of a database object correspondingto a query predicate. An optimizer uses the selectivity of a predicateto estimate the cost of a particular access method and to determineoptimal join order.

[0008] Statistics should be gathered on a regular basis to provide theoptimizer with needed information about schema objects. Significantcosts may be incurred to collect and maintain statistics for databaseobjects. To reduce this collection cost and improve performance, manydatabase systems use data sampling to reduce the amount of data thatmust be collected to provide statistics used by the optimizer. With datasampling, only a portion of the rows within a database table is accessedto generate a set of statistics for the entire table or column. Theresults of the data sampling is thereafter scaled upward to extrapolatethe statistics values for the entire population. However, different datadistributions may require different sample sizes in order to obtainaccurate statistics. If a too-small sample size is selected, then thestatistics may be inaccurate, which could lead to sub-optimal executionplans and poor query performance. If a too-large sample size isselected, then resources are wasted to collect more data than is neededto provide accurate statistics. Consequently, it is desirable to useonly the minimal sample size needed for accurate statistics collection.

[0009] In addition to statistics, optimizers often use data valuehistograms to select an optimal execution plan. A data value histogramis a structure that provides estimates of the distribution of datavalues in a database object. A histogram partitions the data objectvalues in a set of individual “buckets”, so that all valuescorresponding to a given range fall within the same histogram bucket.The histogram provides information that is helpful in determining theselectivity of a predicate that appears in a query.

[0010] In a height-balanced histogram, each bucket of the histogramcorresponds to an equal number of rows in a table. The boundaries of thebuckets shrink or grow so that all buckets maintain the same number ofentries. The useful information provided by the histogram is the rangeof values that corresponds to each bucket, e.g., the endpoints for eachbucket of the histogram. Consider a column C with values between 1 and100 in which the column data is uniformly distributed. FIG. 1a shows aheight-balanced histogram plotted for this column having ten buckets.The number of rows in each bucket of the histogram is one-tenth thetotal number of rows in the table. Since the data. values are evenlydistributed, the endpoints of the buckets are also evenly spaced.

[0011] Now consider a second column having 100 rows for which columndata values are not evenly spaced, in which ninety rows contain thevalue “1” and the other ten rows contain a column value between 2 and100. FIG. 1b shows this column plotted in a height balanced histogramoften buckets. Since ninety percent of the rows have the value “1”, nineof the ten buckets in the histogram of FIG. 1b also correspond to thevalue “1”. Thus, it can be seen that nine of the ten buckets in thehistogram of FIG. 1b have endpoints that end in the number “1”. The lastbucket 106 corresponds to the ten rows in the column having data valuesbetween “2” and “100”. In operation, such a histogram provides anoptimizer with instant knowledge of the selectivity of particular valuesof a column. This selectivity information can be used, for example, todetermine whether either a full table scan or an index access providesthe most efficient path to satisfying a query against the database tablecorresponding to the histogram.

[0012] Other types of histograms also exist. For example, anotherhistogram used by optimizers is the width-balanced histogram, in whichcolumn data is divided into a number of fixed, equal-width ranges andthe histogram is organized to count the number of values falling withineach range.

[0013] A histogram may not always provide an appreciable benefit. Forexample, a histogram may not be useful for a data set having uniformdata distribution, since it can be assumed that all data within that setare equally distributed and therefore the histogram will not provide anyadditional useful information. If a histogram is desired, a significantamount of resources may be needed to collect, maintain, and usehistograms. Therefore, it makes sense to only create, store, and/or usea histogram when such a histogram provides benefits greater than theexpense of the histogram. However, conventional database systemstypically rely upon the skill and knowledge of individual databaseadministrators to manually decide whether histograms should or shouldnot be collected for columns in the database. While guidelines may beprovided to assist this decision-making, this manual process byadministrators often leads to inconsistent and erroneous decisionsresulting in the collection and storage of unneeded histograms, or thefailure to collect histograms that could provide more efficient queryprocessing.

[0014] The present invention provides a method and system fordetermining when to collect histograms. In an embodiment, the inventionprovides a mechanism for automatically deciding when to collecthistograms upon request from the user. This decision is based on thecolumns the user is interested in, the role these columns play in thequeries as submitted to the system, and the underlying distribution forthese columns, e.g., as seen in a random sample. The user specifieswhich columns are of interest, and the database is configured to collectcolumn usage information that describes how each column is being used inthe workload. This column usage information could be stored in memoryand periodically flushed to disk. Given a set of potential columns, thedistribution of those columns is viewed in combination with the usageinformation to determine which columns should have histograms.

[0015] The invention also provides a system and method for determiningan adequate sample size for statistics collection. In one embodiment,the invention provides a mechanism for automatically determining anadequate sample size for both statistics and histograms. This isaccomplished via an iterative approach where the process starts with asmall sample, and for each attribute which may need more data, thesample size is increased while restricting the information collected toonly those attributes that require the larger sample.

[0016] Further details of aspects, objects, and advantages of theinvention are described below in the detailed description, drawings, andclaims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0017] The accompanying drawings are included to provide a furtherunderstanding of the invention and, together with the DetailedDescription, serve to explain the principles of the invention.

[0018]FIGS. 1a and 1 b show example histograms.

[0019]FIG. 2 shows a flowchart of a process for determining sample sizefor statistics collection according to an embodiment of the invention.

[0020]FIG. 3 shows a flowchart of a process for histogram determinationaccording to an embodiment of the invention.

[0021]FIG. 4 shows a flowchart of an alternate process for histogramdetermination according to an embodiment of the invention.

[0022]FIGS. 5 and 6 are diagrams of system architectures with which thepresent invention may be implemented.

DETAILED DESCRIPTION

[0023] The invention is described with reference to specificembodiments. It will, however, be evident that various modifications andchanges may be made thereto without departing from the broader spiritand scope of the invention. The reader is to understand that thespecific ordering and combination of process actions shown in theprocess flow diagrams and system components in component diagramsdescribed herein are merely illustrative, and the invention can beperformed using different, additional, or differentcombinations/ordering of process actions and components. For example,the invention is particularly illustrated herein with reference tospecific database objects such as tables, columns, and rows, but it isnoted that the inventive principles are equally applicable to othertypes of database objects. The specification and drawings are,accordingly, to be regarded in an illustrative rather than restrictivesense.

[0024]FIG. 2 shows a flowchart of a process for-determining sample sizesfor statistics collection, according to an embodiment of the invention.At step 200, an initial sample size is selected for statisticscollection. In an embodiment, the selected sample size could beexpressed as a percentage of the rows in a table. Other measures couldbe used to express sample size, such as an exact number of rows for thetable.

[0025] At step 202, rows in the table are identified based upon theinitially selected sample size. In an embodiment, this is accomplishedby attempting to select the number of rows in the table corresponding tothe percentage value used to express the initially selected sample size.For example, consider if the initially selected sample size is 20% andthe number of rows in the table is 1000. For this example, the expectednumber of rows to be identified in step 202 is (1000)*(0.20)=200 rows.One way to achieve this is to provide a function (e.g., a “samples( )”function) that chooses rows from the table based upon the selectedpercentage value, in which each row is individually laced with a givenpercentage chance of being selected. If the sampling percentage is 20%,then each row in the column individually faces a 20% chance of beingselected. In this manner, over the entire table, it is likely thatapproximately 20% of the rows in the table will be selected. The exactrows to be selected will be subject to a certain amount ofrandomization, and it is possible that the exact number of rows actuallyselected will be greater or smaller than 20%. The statistics gatheredbased upon this sampling can later be used to extrapolate statistics forthe entire table.

[0026] At step 204, a determination is made regarding whether the numberof sample rows identified in step 202 is adequate. In an embodiment,this step is performed by determining whether statistics for theidentified rows using the initial sample size can be adequately scaledupward to extrapolate accurate statistics for the entire table. Oneapproach to accomplishing this is to compare the selected number of rowswith a minimum value for the particular statistics for which sampling isperformed. For example, consider if the statistic being addressed by thesampling is the “Number of Rows in Table.” A minimum value, such as“2500” can be established for this type of statistic. If the identifiednumber of rows from step 202 is less than 2500 rows, then the samplesize or sample percentage is increased (208), and steps 202 and 204 arerepeated until the minimum sample size is achieved. If the number ofrows identified in step 202 meets or exceeds the minimum value, then thesample size is adequate (206).

[0027] It is noted that different statistics may require differing teststo determine whether rows sampled during step 202 can be adequatelyscaled upward to provide statistics for the entire table. The followingare additional examples of statistics used for database optimizers: 1)average column length; 2) number of distinct values in column; 3)minimum value in column; and 4) maximum value in column. For the averagelength, minimum, and maximum statistics, the number of rows sampledduring step 202 can be compared to another minimum value, e.g., “919”,to determine whether the sample size is adequate.

[0028]FIG. 4 is a flowchart of a process for determining whether ahistogram should be collected or saved according to an embodiment of theinvention. At step 402, column usage is tracked during workloadsexecuted against a table. In an embodiment, this is accomplished bymarking individual columns while executing queries against thosecolumns. A recordation is made regarding the type of predicate that isevaluated against a column. For example, this type of recordation trackswhether, and how often, an equality, range or like predicate isevaluated against a column. At step 404, a determination is made whetherdata skew exists for the column values. The predicate type for aparticular column and the data skew within that column are analyzed todetermine whether a histogram should be collected for the column (406).

[0029] In an embodiment, if equality and/or equijoin predicates areevaluated against a column and the column data exhibits non-uniformvalue repetition, then a histogram should be collected and/or saved forthe column. If like or range predicates are evaluated against a columnand the column data exhibits non-uniformity in range, then a histogramshould be collected and/or saved. The meaning of “non-uniform valuerepetition” and “non-uniformity in range” is defined below according toone embodiment of the invention.

[0030] Instead of, or in addition to, the process of FIG. 4, the processshown in FIG. 3 can be used to determine whether a histogram should becollected or saved for a table column. If data sampling is beingperformed, then a determination is made at step 300 whether the samplesize is adequate. If not, then the sampling rate is adjusted upward tocollect an adequate sample size. In one embodiment, if the number ofnon-null column values in the sample is less than 2500, then the samplerate is increased to provide more samples.

[0031] At step 302, a determination is made regarding the expectednumber of buckets for the histogram. At step 304, data uniformity/rangeskew is evaluated for the data sample values with respect to theexpected histogram buckets. In an embodiment, this is accomplished bygathering frequency and histogram information for the column values. Forexample, a simple query can be executed to collect distinct values andtheir counts for a column. At step 306, a determination is made whetherthe column values are uniform. In an embodiment, this determinationchecks whether any values repeat more than other values in the column,or whether there are any range skews in the data. If so, then the datais non-uniform. If the data is non-uniform, then a histogram iscollected for the column (310). If the column data is uniform then thevalues in the column are considered to be equally distributed;therefore, either no histogram is collected or a previously collectedhistogram is not saved/used (308).

[0032] Illustrative Embodiment

[0033] The present section describes pseudocode to implement anillustrative embodiment of the invention. Initially, the illustrativeembodiment begins by building an array of columns needing statistics.Then, the illustrative process primes data structure bits that representwhich statistics need to be gathered for which columns. The process mayre-invoke this procedure when auto-increasing the sample size to re-setthe still necessary bits for statistics requiring an increased samplesize. The process creates a list of query statements needed to gatherall statistics—this “select” list may be reused across all partitionsand subpartitions. These queries are executed to gather statistics forevery table/partition object requested. Finally, the illustrativeprocedure sets the gathered statistics in a data dictionary. The samplesize used while gathering statistics is automatically adjusted duringthe procedure to ensure adequate sample size for the particularstatistics being collected. The following comprises high-levelpseudocode for the illustrative embodiment:

[0034] if auto sample size

[0035] do a quick row count estimate of the object

[0036] initialize all of the statistics bits for the columns

[0037] while there are still unresolved statistics

[0038] generate the from clause for the query

[0039] execute the basic query

[0040] work on all of the desired histograms

[0041] evaluate the basic statistics

[0042] if some statistics are not ready (need larger sample size)

[0043] construct a new select list using the current statistics bits

[0044] The following table defines variables used in the illustrativepseudocode. TABLE 1 Term Definition P Sampling fraction (between 0.0 and1.0) N Number of rows in the table Avg Average column length statisticmin/max Minimum/maximum column value statistic Nv Number of null columnvalues statistic Ndv Number of distinct values statistic S Number ofrows seen in the sample Snnv Number of non-null column values seen inthe sample Sndv Number of distinct values seen in the sample mnb Maximumnumber of buckets allowed in the histogram

[0045] The following is pseudocode for the top level routine forgathering statistics and for determining whether a histogram should becollected:

[0046] estimate n—use block sample count(*) on user's table

[0047] initialize_gatherbits( )

[0048] while (some statistics still need to be (re)collected)

[0049] generate_from_clause( )—includes possible materialization of newtable

[0050] execute_basic( )

[0051] execute_hist( )

[0052] evaluate_basic( )

[0053] This top level pseudocode executes the main functions thatcomprise the statistics gathering processes according to one embodimentof the invention.

[0054] The initialize_gather_bits( ) function is a procedure which takesin the array of columns for which statistics need to be collected andsets bits representing which statistics are needed. These bits are laterindividually cleared after gathering statistics and evaluating theirprobable accuracy. The function is called initially so that the selectlist can be generated from it for all objects. It is later called againto reset the bits for each new object (e.g.,table/partition/subpartition).

[0055] The process takes in the list of columns (including statisticsbits) and creates a select list to be used to gather basic statistics(not including histograms). In an embodiment, the process ensures thatthe select list does not contain more functions than the server canhandle at once, e.g., only 256 distinct aggregates. If it cannot fitthem all in one statement, the caller is informed of which columns areincluded.

[0056] The generate_from_clause( ) function has the responsibility ofgenerating the FROM clause for the basic query and all the histogramqueries. In one embodiment, each histogram uses a separate query, andtherefore employs a separate scan of the data. If many scans arerequired and involve sampling the underlying table, it may be beneficialto materialize the sample once and then pass over that multiple times.If that is the case, this procedure in one embodiment will generate atemporary table and populate it with a sample.

[0057] The execute_basic( ) function handles the basic statistics queryto parse, execute, and fetch information from the database objects. Inan embodiment, the query is generated earlier in the process and thecolumn array provides sufficient information to infer the select list.

[0058] The evaluate_basic( ) procedure looks at the fetched basicstatistics and tries to scale them up. This procedure clears the bitsfor all statistics that are acceptably scaled, and suggests a largersampling percentage if some statistics need to be recollected.

[0059] The execute hist( ) procedure is the driver for collecting andevaluating the histogram statistics. This function looks over allcolumns that are marked as possibly needing histograms. It then collectsa frequency histogram, a height histogram, or both, depending upon theexpected number of distinct values and the requested number of buckets.

[0060] The following comprises pseudocode for an embodiment of theinitialize_gather_bits( ) function, in which the following statisticsare collected: nv, ndv, min, max, and avg. for each column the userrequested mark a bit to indicate need to collect the followingstatistics: nv, ndv, min, max, avg if there is a need to collect ahistogram (see results of Fig. 4) mark a bit indicating this

[0061] The following comprises pseudocode for an embodiment of thegenerate_from_clause( ) function, which establishes the initial samplingfraction p for the statistics gathering process: if first time, set p to5500 / n -- try for 5500 rows otherwise, p is passed in to this functionif ((p <= 0) or (p >= 0.15)) set p to 1.0 -- don't sample if p < 1.0 andthere are multiple passes (due to histograms) materialize the sample inanother table and use that table instead

[0062] In the illustrative embodiment, the process attempts to collect5500 rows. To accomplish this, it is useful to know in advance thenumber of rows in the table. Based upon the number of rows, the samplingfraction p is established as shown in the pseudocode. If the number ofrows is not known, then estimate this value. Certain thresholds can beestablished for the sampling fraction, beyond which the samplingfraction is set to 1. Under certain circumstances, it may make sense tocreate another table to hold the sampled data from the column. Forexample, if multiple passes are needed, e.g., because histograms are tobe collected, then the samples are materialized into a table to preventrepeated accesses to the larger base table.

[0063] The execute_basic( ) function builds up one or more queries toretrieve sampled data and calculates the desired statistics (excludinghistograms in an embodiment). The one or more queries are then executedto retrieve the results for evaluation, as set forth below. In anembodiment, the one or more queries samples rows from the table basedupon the sampling fraction p that was previously established.

[0064] The evaluate_basic( ) function determines whether the number ofrows sampled according to the sampling fraction p can be adequatelyscaled upward for the entire table.

[0065] The following comprises pseudocode for an embodiment of theevaluate_basic( ) function: if(p < 1.0) if (s < 2500) -- too small asample bump up p accordingly n s / p for each column clear allnon-histogram bits that indicate which statistics to collect if nv bitwas set nv = n − snnv / p if (avg, min, or max bit was set) and (snnv <919) bump up p accordingly if ndvbit was set try to scale it up * ifcannot scale upward bump up p accordingly set ndv again for next passelse -- this was not an estimate set all requested statistics

[0066] set avg, min, and max bits again for next pass

[0067] The pseudocode first checks that at least 2500 rows were sampledbased upon the current sampling fraction (p). If not, then the samplingfraction is adjusted upward and the table is re-sampled. If a sufficientnumber of rows has been collected, then the number of rows (n) isestimated based upon the following: n=s/p, where s represents the numberof rows that have been collected.

[0068] For the average length, minimum, and maximum column valuestatistics (avg, min, max), the pseudocode checks that at least 919non-null column values (snnv) are detected in the sample. If so, thenthese values are considered adequate for the entire table. If not, thenthe sampling fraction p is increased for the next pass through thetable.

[0069] For the number of distinct values statistic (ndv), the pseudocodeattempts to scale this statistic up for the entire table. If thestatistic based upon the sampled rows cannot be scaled upward, then thesampling fraction is increased for the next pass through the table.

[0070] The following comprises pseudocode for scaling ndv and density(defined below) statistics according to an embodiment of the invention:sdiv := sndv / snnv if((snnv< 100) or ((snnv >= 100) and (snnv < 500)and (sdiv > 0.3299)) or ((snnv >= 500) and (snnv < 1000) and (sdiv>0.4977)) or ((snnv >= 1000) and (snnv <2000) and (sdiv> 0.58 17)) or((snnv >= 2000) and (snnv < 5000) and (sdiv > 0.6634)) or ((snnv >=5000) and (snnv < 10000) and (sdiv> 0.7584)) or ((snnv >= 10000) and(snnv < 1000000) and (sdiv> 0.8 169)) or ((snnv >= 1000000) and (sdiv >0.9784))) cannot reliable use kkesdv to scale the value else can usekkesdv scaling reliably nnv := snnv / p if ((sndv = snnv) and • ((snnv >29472) or ((nnv < 10000) and (snnv > 708)) or ((nnv < 40000) and(snnv >= 1813)) or ((nnv < 160000) and (snnv >= 4596)) or ((nnv <640000) and (snnv >= 11664)))) then can use linear scaling reliably --ndv := sndv * 1/p else cannot reliably use linear scaling to scale thevalue

[0071] The following comprises pseudocode for an embodiment of thekkesdr scaling function: x1 :=sndv x2 :=nnv stay_loop := true while(stay_loop and (x1 < x2) x := floor( (x2+x1)/2) y2 : x *(1−power(1−(1/x), snnv ) ) if (sndv < y2) x2 :x−1 elseif (sndv > y2) x1: x+1 else stay_loop := false ndv := x

[0072] The execute_hist( ) function determines whether a histogramshould be collected. The following comprises pseudocode for anembodiment of the execute_hist( ) function: for each column with thehistogram bit set if ((p < 1.0) and (snnv <2500)) not enough data --bump up p for next pass accordingly else if # buckets specified via aninteger or repeat set mnb to that value else set mnb to min(75,(max(200,snnv/26))) estimate the ndv based on prior information if available if(estimated ndv < (mnb * 0.75)) -- probably a frequency histogramexecute_frequency( ) if still need to collect histogram execute_height()

[0073] As before, the pseudocode checks whether 2500 rows have beencollected during the sampling process. If not, then the samplingfraction (p) is increased for the next pass through the table. Themaximum number of buckets (mnb) is set as shown in the pseudocode. Thenumber of distinct values (ndv) is estimated, possibly based upon aprevious pass through the table and the prior execution of theevaluate_basic( ) function.

[0074] If it is desired to collect a histogram and the estimated ndvvalue is below a given threshold (mnb*0.75), then a frequency histogramis generated in an embodiment. A frequency histogram is oftenappropriate for a column having a small number of distinct values. In afrequency histogram, the endpoints of multiple buckets have the sameendpoint value (because the same value entry is in multiple buckets).For this reason, buckets having the same endpoint values often do notneed an explicitly expressed endpoint. This provides one or more “bucketgaps” in the histogram that allows comparatively cheap storage andcompressed representation of such frequency histograms. If this type ofdata distribution is identified, then the process preferably creates afrequency histogram using the execute_frequency( ) function. If it isdesired to collect a histogram and the ndv value is greater than anestablished threshold, then the procedure generates a height-balancedhistogram using the execute_height( ) function in an embodiment of theinvention.

[0075] The following comprises pseudocode for an embodiment of theexecute_frequency( ) function: build up frequency query and execute itif (ndv <= mnb) -- have a good frequency histogram clear histogramcollection bit

[0076] The following pseudocode can be used to build up a frequencyquery according to an embodiment of the invention: select c, count(*)from t sample (s) where c is not null group by c order by c;

[0077] This query collects column values from a table and performs acount of the values.

[0078] The following comprises pseudocode for an embodiment of theexecute_height( ) function: build up a height-balanced query and executeit check for non-uniformity if non-uniformity exists try to scale themultiplicative inverse of the density if it can be successfully scaled-- histogram is ready clear histogram collection bit else clearhistogram collection bit -- no histogram needed

[0079] In this pseudocode, the column values are checked fornon-uniformity. If the column values are uniform, then no histogram iscollected. Otherwise, the pseudocode attempts to scale themultiplicative inverse of the density using the previously describedprocess for scaling ndv. In prior evaluations, the number of repetitionswas considered uniform over the values; but once histograms areintroduced, the popular values can be removed to remove influence uponnon-popular values in the histogram.

[0080] According to an embodiment, a popular value is a value thatcorresponds to more than one endpoint in a height-balanced histogram.All values that are not popular are considered non-popular. Density isthe expected number of repeated occurrences of a non-popular value. Inone embodiment, density can be calculated as the sum of the square ofthe repetition counts for non-popular values divided by the product ofthe number of rows in the table and the number of non-popular values inthe table.

[0081] The following comprises pseudocode for building up aheight-balanced query according to one embodiment of the invention:select maxbkt, min(value) minval, max(value) maxval, sum(rep) sumrep,sum(repsq) sumrepsq, max(rep) maxrep, count(*) bktndv from( selectvalue, max(bkt) maxbkt, count(value) rep, count(value)*count(value)repsq from ( select c as value, ntile(mnb) over (order by c) bkt from tsample(s) where c is not null ) group by value; ) group by maxbkt orderby maxbkt;

[0082] Here, the inner select statement calls an ntile( ) function,which creates a height-balanced histogram and places data sample valuesinto appropriate buckets in the histogram. In an embodiment, such afunction creates an uncompressed histogram and returns a numberrepresenting the bucket that a value falls into. The repetition counts(and square of repetition counts) are selected in the middle statement.The outer loop performs a count and checks the values and buckets forthe result set. The max and min values for the buckets are reviewed toobtain the histogram endpoints.

[0083] Density, which is related to the selectivity of non-popularvalues in the data sample, is calculated in this procedure using thefunction results from the outer loop. This is computed in an embodimentby looking at the number of repetitions of a non-popular value.

[0084] The result of this query is that one row is obtained per bucket,with missing buckets coming from the more popular values. Each row willhave the minimum and maximum value for that bucket, along with thenumber of rows in that bucket, the sum of the repetition counts andsquare of the repetition counts for rows in that bucket, and the numberof repetitions for the most popular value in that bucket. All missingbuckets have been folded into the nearest bucket that is larger.

[0085] For example, consider if the process ends up with the following:maxbkt minval maxval sumrep sumrepsq maxrep 1 1 2 2 2 1 4 3 4 9 65 8 5 55 3 9 3 6 6 8 4 6 2 8 9 10 6 20 4

[0086] This would mean that the number 3 is popular because it is thelargest value in the missing buckets 2 and 3. Notice that the number 9is not a popular value because it is only the largest value of a singlebucket, bucket 7, and thus would only appear once as a histogramendpoint. To calculate density, the influence of the popular value, 3,would be removed. Since the value 3 appears 8 times, the number 8 issubtracted from the sumrep sum and 64 (square of 8) from the sumrepsqsum. This enables the computation at a density which is based upon thenumber of rows in the table, the number of non-popular values in thecolumn, and the sum of the square of the repetition counts ofnon-popular values.

[0087] The following pseudocode provides an illustrative embodiment ofthe invention for histogram determination that was generally describedwith respect to FIG. 4. for each column, c, for which a histogram isconsidered if the user has specified size create and save a histogramwith number of buckets = requested size else if the user has specifiedsize repeat if c already has a histogram with b buckets create and savea histogram with b buckets else if the user has specified size skewonlycreate a histogram if the created histogram exhibits equality or rangeskew save it in the dictionary else if the user has specified size autocheck the dictionary for column usage information if c has been in apredicate involving an equality, range, or like create a histogram if cappeared in an equality (including equijoin) predicate if the histogramexhibits non-uniformity in value repetition save it in the dictionary ifc appeared in a like or range predicate (not involving join) if thehistogram exhibits non-uniformity in range save it in the dictionary anyprior histogram on c will be removed

[0088] The first portion of the pseudocode relates to specificinstructions from a user to create a histogram, which results in thecreation of the desired histogram. The specific histogram is createdwithout a determination as to whether it is actually needed.Alternatively, the invention can be adapted to automatically checkwhether a histogram specifically called for by a user should actually becollected and/or saved.

[0089] The second portion of the pseudocode relates to automateddetermination of histogram collection. In this illustrative embodiment,the following items of information are utilized for histogramdetermination: 1) the subset of columns for which the user wants togather statistics; 2) the columns which already have histograms createdfor them; 3) column usage information; and 4) the distribution of data,e.g., as seen in a data sample. Because data distribution information isinvolved, this process may be advantageously used in conjunction withthe process of FIG. 2 for automated sample size determination.

[0090] In the illustrative embodiment, column usage information isconsidered in conjunction with data distribution information for thatcolumn to determine whether a histogram should be collected and stored.Column usage information includes, for example, the type of predicatesthat is executed against the column. The data skew of the column isevaluated against the type of predicate for that column to determinewhether a histogram is needed.

[0091] When parsing a statement for the first time in an embodiment, thecost-based optimizer looks at the statistics on all of the objects(tables, columns, etc.) involved in the statement. For each column inthe where clause, it will estimate the selectivity of the predicateinvolving that column. At this point, the system will make an entry inthe data structure for the column indicating what type of predicate itwas involved in.

[0092] It an embodiment, column usage information is collected everytime a user hard-parses a statement, in which a bit is marked in memoryfor the column usage information. Whenever information is flushed todisk, these bits indicate whether to increment the appropriatedictionary columns. For example, if a query containing a column with arange predicate was hard parsed since the last flush, the system willincrement the range_predicate counter for that column when the nextflush procedure takes place, as well as updating the timestamp. Onereason for using counters on disk is to provide a better feel for theimportance of the predicate. Counters can also be used in memory, butmay result in expensive overhead.

[0093] In the illustrative pseudocode, a histogram is created if thecolumn is involved in equality, range, or like predicates. In anembodiment, the histogram is created based on a sampled portion of thecolumn, and is preferably created using a small sample of the entirepopulation that is sufficient to both determine the need for histogramsand produce histograms which are representative of the entirepopulation. The number of bucketsin the histogram could be based on thesample size. The range max and min is selected based upon the datasamples. Values in the data samples are placed into the selectedbuckets. The process then counts the number of equi-height endpointsthat fall within the equi-width buckets. The buckets are reviewed todetermine if any buckets are overly large or small. If so, then it islikely that the column does not have uniform data distribution, therebyindicating range skew. In addition, the act of creating a histogram alsoprovides an estimate for the number of distinct values, providing anextra benefit even if the histogram is later discarded.

[0094] If an equality or equijoin predicate is involved, then thehistogram is saved only if the histogram exhibits non-uniformity invalue repetition. For purposes of this example, a column will beconsidered to have non-uniform value repetition if any value is popular,e.g., repeats as an endpoint in the histogram.

[0095] If a like or range predicate is involved, then the histogram issaved only if the histogram exhibits non-uniformity in range. In oneembodiment, a column is considered to have non-uniformity in range if itpasses the following test:

[0096] given that the created histogram had b equi-height buckets

[0097] divide the range (max-min) into b equi-width buckets

[0098] sum =0

[0099] for each equiwidth bucket

[0100] count the number of equi-height endpoints that fall in the bucketsum+=(count*count)

[0101] if (sum/b)>1.7 this column is considered to be non-uniform inrange

[0102] For a uniform column, the equi-height endpoints would coincidewith the equi-width endpoints, and the sum would simply be b.

SYSTEM ARCHITECTURE OVERVIEW

[0103] Referring to FIG. 5, in an embodiment, a computer system 520includes a host computer 522 connected to a plurality of individual userstations 524. In an embodiment, the user stations 524 each comprisesuitable data terminals, for example, but not limited to, e.g., personalcomputers, portable laptop computers, or personal data assistants(“PDAs”), which can store and independently run one or moreapplications, i.e., programs. For purposes of illustration, some of theuser stations 524 are connected to the host computer 522 via a localarea network (“LAN”) 526. Other user stations 524 are remotely connectedto the host computer 522 via a public telephone switched network(“PSTN”) 528 and/or a wireless network 530.

[0104] In an embodiment, the host computer 522 operates in conjunctionwith a data storage system 531, wherein the data storage system 531contains a database 532 that is readily accessible by the host computer522. Note that a multiple tier architecture can be employed to connectuser stations 524 to a database 532, utilizing for example, a middleapplication tier (not shown). In alternative embodiments, the database532 may be resident on the host computer, stored, e.g., in the hostcomputer's ROM, PROM, EPROM, or any other memory chip, and/or its harddisk. In yet alternative embodiments, the database 532 may be read bythe host computer 522 from one or more floppy disks, flexible disks,magnetic tapes, any other magnetic medium, CD-ROMs, any other opticalmedium, punchcards, papertape, or any other physical medium withpatterns of holes, or any other medium from which a computer can read.In an alternative embodiment, the host computer 522 can access two ormore databases 532, stored in a variety of mediums, as previouslydiscussed.

[0105] Referring to FIG. 6, in an embodiment, each user station 524 andthe host computer 522, each referred to generally as a processing unit,embodies a general architecture 605. A processing unit includes a bus606 or other communication mechanism for communicating instructions,messages and data, collectively, information, and one or more processors607 coupled with the bus 606 for processing information. A processingunit also includes a main memory 608, such as a random access memory(RAM) or other dynamic storage device, coupled to the bus 606 forstoring dynamic data and instructions to be executed by the processor(s)607. The main memory 608 also may be used for storing temporary data,i.e., variables, or other intermediate information during execution ofinstructions by the processor(s) 607. A processing unit may furtherinclude a read only memory (ROM) 609 or other static storage devicecoupled to the bus 606 for storing static data and instructions for theprocessor(s) 607. A storage device 610, such as a magnetic disk oroptical disk, may also be provided and coupled to the bus 606 forstoring data and instructions for the processor(s) 607.

[0106] A processing unit may be coupled via the bus 606 to a displaydevice 611, such as, but not limited to, a cathode ray tube (CRT), fordisplaying information to a user. An input device 612, includingalphanumeric and other columns, is coupled to the bus 606 forcommunicating information and command selections to the processor(s)607. Another type of user input device may include a cursor control 613,such as, but not limited to, a mouse, a trackball, a fingerpad, orcursor direction columns, for communicating direction information andcommand selections to the processor(s) 607 and for controlling cursormovement on the display 611.

[0107] According to one embodiment of the invention, the individualprocessing units perform specific operations by their respectiveprocessor(s) 607 executing one or more sequences of one or moreinstructions contained in the main memory 608. Such instructions may beread into the main memory 608 from another computer-usable medium, suchas the ROM 609 or the storage device 610. Execution of the sequences ofinstructions contained in the main memory 608 causes the processor(s)607 to perform the processes described herein. In alternativeembodiments, hard-wired circuitry may be used in place of or incombination with software instructions to implement the invention. Thus,embodiments of the invention are not limited to any specific combinationof hardware circuitry and/or software.

[0108] The term “computer-usable medium,” as used herein, refers to anymedium that provides information or is usable by the processor(s) 607.Such a medium may take many forms, including, but not limited to,non-volatile, volatile and transmission media. Non-volatile media, i.e.,media that can retain information in the absence of power, includes theROM 609. Volatile media, i.e., media that can not retain information inthe absence of power, includes the main memory 608. Transmission mediaincludes coaxial cables, copper wire and fiber optics, including thewires that comprise the bus 606. Transmission media can also take theform of carrier waves; i.e., electromagnetic waves that can bemodulated, as in frequency, amplitude or phase, to transmit informationsignals. Additionally, transmission media can take the form of acousticor light waves, such as those generated during radio wave and infrareddata communications.

[0109] Common forms of computer-usable media include, for example: afloppy disk, flexible disk, hard disk, magnetic tape, any other magneticmedium, CD-ROM, any other optical medium, punchcards, papertape, anyother physical medium with patterns of holes, RAM, ROM, PROM (i.e.,programmable read only memory), EPROM (i.e., erasable programmable readonly memory), including FLASH-EPROM, any other memory chip or cartridge,carrier waves, or any other medium from which a processor 607 canretrieve information. Various forms of computer-usable media may beinvolved in providing one or more sequences of one or more instructionsto the processor(s) 607 for execution. The instructions received by themain memory 608 may optionally be stored on the storage device 610,either before or after their execution by the processor(s) 607.

[0110] Each processing unit may also include a communication interface614 coupled to the bus 606. The communication interface 614 providestwo-way communication between the respective user stations 624 and thehost computer 622. The communication interface 614 of a respectiveprocessing unit transmits and receives electrical, electromagnetic oroptical signals that include data streams representing various types ofinformation, including instructions, messages and data. A communicationlink 615 links a respective user station 624 and a host computer 622.The communication link 615 may be a LAN 526, in which case thecommunication interface 614 may be a LAN card. Alternatively, thecommunication link 615 may be a PSTN 528, in which case thecommunication interface 614 may be an integrated services digitalnetwork (ISDN) card or a modem. Also, as a further alternative, thecommunication link 615 may be a wireless network 530. A processing unitmay transmit and receive messages, data, and instructions, includingprogram, i.e., application, code, through its respective communicationlink 615 and communication interface 614. Received program code may beexecuted by the respective processor(s) 607 as it is received, and/orstored in the storage device 610, or other associated non-volatilemedia, for later execution. In this manner, a processing unit mayreceive messages. data and/or program code in the form of a carrierwave.

1. A method for collecting information used by an optimizer in adatabase system, comprising: receiving a request to collect a statisticfor a database object; automatically selecting a sample size foraccessing the database object; collecting a sampled statistic using thesample size for accessing the database object; and scaling the sampledstatistic for the data object as appropriate for the sample size andtype of statistic being collected.
 2. The method of claim 1 in which thestep of automatically selecting the sample size comprises an iterativeprocedure for increasing the amount of the data object until the sampledstatistic is deemed acceptable.
 3. The method of claim 2 in which thesampled statistic is deemed acceptable if the sampled statistic can bescaled for the entire data object.
 4. The method of claim 1 in which thestatistic comprises the number of rows in a database table.
 5. Themethod of claim 4 further comprising determining if the sample sizeincludes 2500 or more rows of data.
 6. The method of claim 1 in whichthe statistic is elected from the group consisting of: average columnlength, maximum value, minimum value.
 7. The method of claim 6 furthercomprising determining if the sample size includes at least 919 or morerows of data.
 8. The method of claim 1 in which the sample size isexpressed as a sampling fraction.
 9. The method of claim 8 in which thesampling fraction is independently evaluated against each individualunit in the data object.
 10. The method of claim 1 in which thestatistic comprises a histogram.
 11. The method of claim 1 in which thesample size is selected to attempt retrieval of at least 5500 units ofthe data object.
 12. A computer program product that includes acomputer-usable medium comprising a sequence of instructions which, whenexecuted by a processor, causes said processor to execute a process forcollecting information used by an optimizer in a database system, saidprocess comprising: receiving a request to collect a statistic for adatabase object; automatically selecting a sample size for accessing thedatabase object; collecting a sampled statistic using the sample sizefor accessing the database object; and scaling the sampled statistic forthe data object as appropriate for the sample size and type of statisticbeing collected.
 13. The computer program product of claim 12 in whichthe step of automatically selecting the sample size comprises aniterative procedure for increasing the amount of the data object untilthe sampled statistic is deemed acceptable.
 14. The computer programproduct of claim 13 in which the sampled statistic is deemed acceptableif the sampled statistic can be scaled for the entire data object. 15.The computer program product of claim 12 in which the statisticcomprises the number of rows in a database table.
 16. The method ofclaim 15 further comprising determining if the sample size includes 2500or more rows of data.
 17. The computer program product of claim 12 inwhich the statistic is elected from the group consisting of: averagecolumn length, maximum value, minimum value.
 18. The computer programproduct of claim 17 further comprising determining if the sample sizeincludes at least 919 or more rows of data.
 19. The computer programproduct of claim 12 in which the sample size is expressed as a samplingfraction.
 20. The computer program product of claim 19 in which thesampling fraction is independently evaluated against each individualunit in the data object.
 21. The computer program product of claim 12 inwhich the statistic comprises a histogram.
 22. The computer programproduct of claim 12 in which the sample size is selected to attemptretrieval of at least 5500 units of the data object.
 23. A system forcollecting information used by an optimizer in a database system,comprising: means for receiving a request to collect a statistic for adatabase object; means for automatically selecting a sample size foraccessing the database object; means for collecting a sampled statisticusing the sample size for accessing the database object; and means forscaling the sampled statistic for the data object as appropriate for thesample size and type of statistic being collected.
 24. The system ofclaim 23 in which the means for automatically selecting the sample sizecomprises means for an iterative procedure for increasing the amount ofthe data object until the sampled statistic is deemed acceptable. 25.The system of claim 24 in which the sampled statistic is deemedacceptable if the sampled statistic can be scaled for the entire dataobject.
 26. The system of claim 23 in which the statistic comprises thenumber of rows in a database table.
 27. The method of claim 26 furthercomprising determining if the sample size includes 2500 or more rows ofdata.
 28. The system of claim 23 in which the statistic is elected fromthe group consisting of: average column length, maximum value, minimumvalue.
 29. The system of claim 28, further comprising determining thesample size is at least 919 rows or more of data.
 30. The system ofclaim 23 in which the sample size is expressed as a sampling fraction.31. The system of claim 30 in which the sampling fraction isindependently evaluated against each individual unit in the data object.32. The system of claim 23 in which the statistic comprises a histogram.33. The system of claim 23 in which the sample size is selected toattempt retrieval of at least 5500 units of the data object.